Hotelling T2
###########################################
To detect outliers across the multidimensional space of PCA, the **hotellings T2** test is implemented. 
It works by computing the **chi-square tests** across the top n_components: default is PC1 to PC5. The reason for not using more PC's is that it is expected that the highest variance (and thus the outliers) will be seen in the first few components. Going deeper into PC space may therefore not be required. Nevertheless, the depth is optional. 

The **hotellings T2** test outputs a P-value matrix (samples x PCs), which are are subsequently combined using **fishers method**. This allows to determine outliers with its ranking (strongest to weak). The alpha parameter determines the detection of outliers (default: 0.05).


.. code:: python

	# Import libraries
	from pca import pca
	import pandas as pd
	import numpy as np

	# Create dataset with 100 samples
	X = np.array(np.random.normal(0, 1, 500)).reshape(100, 5)
	# Create 5 outliers
	outliers = np.array(np.random.uniform(5, 10, 25)).reshape(5, 5)
	# Combine data
	X = np.vstack((X, outliers))

	# Initialize model. Alpha is the threshold for the hotellings T2 test to determine outliers in the data.
	model = pca(alpha=0.05, detect_outliers=['ht2', 'spe'])

	# Fit transform
	results = model.fit_transform(X)

	# [pca] >Column labels are auto-completed.
	# [pca] >Row labels are auto-completed.
	# [pca] >The PCA reduction is performed to capture [95.0%] explained variance using the [5] columns of the input data.
	# [pca] >Fitting using PCA..
	# [pca] >Computing loadings and PCs..
	# [pca] >Computing explained variance..
	# [pca] >Number of components is [4] that covers the [95.00%] explained variance.
	# [pca] >The PCA reduction is performed on the [5] columns of the input dataframe.
	# [pca] >Fitting using PCA..
	# [pca] >Computing loadings and PCs..
	# [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[4]
	# [pca] >Outlier detection using SPE/DmodX with n_std=[2]


The information regarding the outliers are stored in the dict 'outliers' (see below). The outliers computed using **hotelling T2** test are the columns **y_proba**, **y_score** and **y_bool**. The outliers computed using **SPE/DmodX** are the columns **y_bool_spe**, **y_score_spe**, where **y_score_spe** is the euclidean distance of the center to the samples. The rows belong the same rows on the input samples.


.. code:: python

	print(results['outliers'])
	# or
	print(model.results['outliers'])


	#            y_proba      y_score  y_bool  y_bool_spe  y_score_spe
	# 1.0   9.799576e-01     3.060765   False       False     0.993407
	# 1.0   8.198524e-01     5.945125   False       False     2.331705
	# 1.0   9.793117e-01     3.086609   False       False     0.128518
	# 1.0   9.743937e-01     3.268052   False       False     0.794845
	# 1.0   8.333778e-01     5.780220   False       False     1.523642
	# ..             ...          ...     ...         ...          ...
	# 1.0   6.793085e-11    69.039523    True        True    14.672828
	# 1.0  2.610920e-291  1384.158189    True        True    16.566568
	# 1.0   6.866703e-11    69.015237    True        True    14.936442
	# 1.0  1.765139e-292  1389.577522    True        True    17.183093
	# 1.0  1.351102e-291  1385.483398    True        True    17.319038


SPE/Dmodx
###########################################
Outlier can be detected using SPE/DmodX (distance to model) based on the mean and covariance of the first 2 dimensions of X.
On the model plane (SPE ≈ 0). Note that the SPE or Hotelling’s T2 are complementary to each other.

.. code:: python

	model.biplot(legend=True, SPE=True, HT2=True)
	model.biplot3d(legend=True, SPE=True, HT2=True)

	# Create only the scatter plots
	model.scatter(legend=True, SPE=True, HT2=True)
	model.scatter3d(legend=True, SPE=True, HT2=True)
    

.. |figO1| image:: ../figs/outliers_biplot_spe_hot.png
.. |figO2| image:: ../figs/outliers_biplot3d.png

.. table:: Detected outliers
   :align: center

   +----------+----------+
   | |figO1|  | |figO2|  |
   +----------+----------+


Selection of the Outliers
###########################################
Selecting the outliers can be usefull to remove them from the dataset or for deeper investigation.

.. code:: python

	# Select the outliers
	Xoutliers = X[results['outliers']['y_bool'],:]

	# Select the other set
	Xnormal = X[~results['outliers']['y_bool'],:]


Detect new unseen outliers
###########################################
After fitting a model on the data, you may want to use the model in a later stage to detect outliers on *unseen* data.
Detection of *outliers* is performed in the **transform** function and does not require any additional action. An example is shown in the code block below.

.. code:: python

	# Import libraries
	from pca import pca
	import pandas as pd
	import numpy as np

	# Create dataset with 100 samples
	X = np.array(np.random.normal(0, 1, 500)).reshape(100, 5)

	# Initialize model. Alpha is the threshold for the hotellings T2 test to determine outliers in the data.
	model = pca(alpha=0.05, detect_outliers=['ht2', 'spe'])

	# Fit transform
	model.fit_transform(X)

	# Create 5 outliers
	X_unseen = np.array(np.random.uniform(5, 10, 25)).reshape(5, 5)

	# Transform new "unseen" data into existing PC space.
	PCnew = model.transform(X_unseen)

	# Plot image
	model.biplot(SPE=True, HT2=True)

.. |figO3| image:: ../figs/outliers/biplot_outliers.png

.. table:: Detected outliers
   :align: center

   +----------+
   | |figO3|  |
   +----------+


The ``transform`` function will add the *new* samples to the readily fitted space and update the outlier parameters (default is True). If for example outliers are seen over and over again, they may not be an outlier anymore at a certain point. If you **do not** want to keep all outliers as being an outlier, set the update outlier parameter to False ``update_outlier_params=False``. In case you do not want to add any samples. Let me demonstrate this by example:


.. code:: python

	from pca import pca
	import pandas as pd
	import numpy as np

	# Create dataset with 100 samples
	X = np.array(np.random.normal(0, 1, 500)).reshape(100, 5)

	# Initialize model. Alpha is the threshold for the hotellings T2 test to determine outliers in the data.
	model = pca(alpha=0.05, detect_outliers=['ht2', 'spe'])
	# model = pca(alpha=0.05, detect_outliers=None)

	# Fit transform
	model.fit_transform(X)

	for i in range(0, 10):
	    # Create 5 outliers
	    X_unseen = np.array(np.random.uniform(5, 10, 25)).reshape(5, 5)

	    # Transform new "unseen" data into existing PC space.
	    PCnew = model.transform(X_unseen, row_labels=np.repeat('mapped_' + str(i), X_unseen.shape[0]), update_outlier_params=True)

	    # Scatterplot
	    model.scatter(SPE=True, HT2=True)
	    # Biplot
	    # Model.biplot(SPE=True, HT2=True)


In each iteration, the elips is growing towards the outliers. If you look carefully at the number of outliers, you may notice that the number of outliers is not increasing but even decreasing at a certain point. 

.. |figO4| image:: ../figs/outliers/pca_outliers_iteration.gif

.. table:: Detected outliers
   :align: center

   +----------+
   | |figO4|  |
   +----------+


Detection of outliers without PCA
###########################################
If desired, the outliers can also be detected directly using the hotelling T2 and/or SPE/DmodX functionality.
For demonstration purposes I use the PCs from the results but any other dataset would also work.

.. code:: python

	import pca
	outliers_hot = pca.hotellingsT2(results['PC'].values, alpha=0.05)
	outliers_spe = pca.spe_dmodx(results['PC'].values, n_std=2)


.. include:: add_bottom.add